Skip to content

[codex] Optimize Vesuvius training runtime and validation#802

Open
giorgioangel wants to merge 3 commits intomerge-ink-pipelinesfrom
codex/vesuvius-runtime-optimizations
Open

[codex] Optimize Vesuvius training runtime and validation#802
giorgioangel wants to merge 3 commits intomerge-ink-pipelinesfrom
codex/vesuvius-runtime-optimizations

Conversation

@giorgioangel
Copy link
Copy Markdown
Member

What changed

  • optimized Vesuvius training/runtime behavior in the vesuvius stack only
  • kept distributed validation with coherent rank-0 W&B aggregation and rotating preview selection
  • kept validation GIFs enabled while improving preview patch/slice selection
  • moved active run outputs and local W&B data to /ephemeral
  • added NUMA pinning support for CUDA DDP ranks and inherited worker affinity
  • added a binary edt fast path for surface dilation while keeping scipy fallback semantics for non-binary cases
  • kept persistent derived-target caching for validation and memory-only caching for training
  • preserved training-time skeletonization after augmentation
  • added regression tests for preview selection and dilation behavior

Why

  • the remote H100 canaries showed real validation wins from distributed validation and ps128 throughput gains from batch_size: 24
  • the remaining runtime issues were dominated by training/validation pipeline overhead and slow artifact I/O rather than connected-components metrics
  • the changes focus on the highest-likelihood real gains without staging operational artifacts or secrets

Validation

Ran on the remote repo under /home/ubuntu/villa/vesuvius:

  • .venv/bin/python3 -m py_compile on the touched Python files
  • PYTHONPATH=src .venv/bin/pytest tests/models/test_validation_preview.py tests/models/test_zarr_dataset_dilation.py
  • live remote canary profiling for ps128 and ps256
  • NUMA affinity verified on running rank processes via /proc/<pid>/status

Notes

  • PR intentionally excludes root-level operational artifacts and configs such as .patches_cache/, _codex_backup_20260331/, bench_edt_vs_scipy.py, and the root ps128_medial_default.yaml / ps256_medial_default.yaml
  • PR targets merge-ink-pipelines because the remote source branch is based on that branch rather than main

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 31, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
scrollprize-org Ignored Ignored Preview Mar 31, 2026 10:15am

Request Review

@giorgioangel giorgioangel marked this pull request as ready for review March 31, 2026 10:51
@giorgioangel giorgioangel requested a review from jrudolph as a code owner March 31, 2026 10:51
@giorgioangel giorgioangel requested a review from bruniss March 31, 2026 10:51
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e9f706a32f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +494 to +495
# Deterministic skeleton targets are generated in the dataset before augmentation.
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore validation skeleton transform for non-Zarr datasets

create_validation_transforms now always returns None, but MutexAffinityDataset._initialize_transforms still invokes this helper when skeleton losses are configured. In that path, validation no longer produces *_skel tensors, and BaseTrainer._compute_loss_value will call skeleton losses without the required skel argument, which raises for DC_SkelREC_and_CE_loss/SoftSkeletonRecallLoss. This breaks mutex-affinity validation whenever skeleton-supervised losses are enabled.

Useful? React with 👍 / 👎.

Comment on lines +188 to +190
cache_key = self._cache_key(patch_info, target_key, ignore_value)
cached = self._cache_get(cache_key)
if cached is not None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Disable skeleton cache for augmented training samples

The new cache key is based only on static patch_info (volume/position/patch size), but this transform is appended after stochastic augmentations in the training pipeline. That means repeated patches can reuse a cached skeleton computed for a different augmented variant, so {target}_skel can diverge from the current target tensor and silently corrupt skeleton-supervised training on ZarrDataset runs.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant